fix: make tool call assertions flexible to prevent CI flakes by iamemilio · Pull Request #5295 · llamastack/llama-stack

iamemilio · 2026-03-25T17:04:51Z

Summary

Loosens strict len(response.output) == 1 assertions in tool call integration tests to accept one or more function calls (>= 1), since models may legitimately return multiple parallel tool calls for a single-tool prompt
Updates follow-up turns to respond to all returned tool calls (not just the first), preventing the tool_call_id not responded to API error when the model produces duplicate invocations
These changes make tests resilient to valid model behavior variations across different providers and model versions without weakening what is actually being tested

Motivation

When evaluating cheaper model alternatives for the Azure CI suite, we observed that some models (e.g. gpt-4.1-nano) occasionally return two parallel get_weather function calls instead of one. This is a valid API response — the model is correctly identifying the tool to call, just being eager about parallelism. The previous == 1 assertions treated this as a failure, creating intermittent CI flakes.

The fix is backwards-compatible: tests still pass when a model returns exactly one call (since 1 >= 1), while also accepting the multi-call case. The core behavior being verified (correct tool invocation, successful follow-up round-trip, final message with text) remains unchanged.

Test plan

Ran the full responses suite with gpt-4.1-nano via OpenAI — 196 passed, 0 failed, 26 skipped
Verified the previously flaky tests (test_function_call_output_list_text, test_function_call_output_list_text_multi_block, test_response_non_streaming_custom_tool) pass consistently with the loosened assertions
Assertions remain logically correct — still verify function_call type, status, name, and successful text response after tool output

Some models return multiple parallel tool calls for a single-tool prompt, which is a valid API response. The previous assertions required exactly one function call, causing intermittent CI failures when models produced logically correct but duplicated tool invocations. This relaxes the assertions to accept one or more function calls and responds to all of them in follow-up turns, preventing the "tool_call_id not responded to" error that occurs when only the first call is acknowledged. Signed-off-by: Emilio Garcia <i.am.emilio@gmail.com> Made-with: Cursor

mattf

can you pass parallel_tool_calls=False to create instead?

it'd be good to have a clear parallel_tool_calls=True test too. historically many models would fail it, but we should behave correctly. especially important is parallel_tool_calls=True and stream=True.

iamemilio requested review from ashwinb, bbrowning, cdoern, ehhuang, franciscojavierarceo, leseb, mattf and raghotham as code owners March 25, 2026 17:04

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 25, 2026

mattf reviewed Mar 25, 2026

View reviewed changes

Merge branch 'main' into fix/flexible-tool-call-assertions

e382254

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: make tool call assertions flexible to prevent CI flakes#5295

fix: make tool call assertions flexible to prevent CI flakes#5295
iamemilio wants to merge 2 commits intollamastack:mainfrom
iamemilio:fix/flexible-tool-call-assertions

iamemilio commented Mar 25, 2026 •

edited

Loading

Uh oh!

mattf left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

iamemilio commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Test plan

Uh oh!

mattf left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

iamemilio commented Mar 25, 2026 •

edited

Loading